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Abstract. Cosmological analyses can be accelerated by approximating slow calculations 
using a training set, which is either precomputed or generated dynamically. However, this 
approach is only safe if the approximations are well understood and controlled. This paper 
surveys issues associated with the use of machine-learning based emulation strategies for 
accelerating cosmological parameter estimation. We describe a learn-as-you-go algorithm 
that is implemented in the CoSMO-|--|- code^ and (1) trains the emulator while simultaneously 
estimating posterior probabilities; (2) identifies unreliable estimates, computing the exact 
numerical likelihoods if necessary; and (3) progressively learns and updates the error model 
as the calculation progresses. We explicitly describe and model the emulation error and show 
how this can be propagated into the posterior probabilities. We apply these techniques to the 
Planck likelihood and the calculation of ACDM posterior probabilities. The computation is 
significantly accelerated without a pre-defined training set and uncertainties in the posterior 
probabilities are subdominant to statistical fluctuations. We have obtained a speedup factor 
of 6.5 for Metropolis-Hastings and 3.5 for nested sampling. Finally, we discuss the general 
requirements for a credible error model and show how to update them on-the-fly. 


^http://cosmopp.com 
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1 Introduction 

Constraining cosmological models using experimental data typically requires many evalua¬ 
tions of the data likelihood, which can be very slow. For instance, the analysis of galaxy 
surveys typically requires an estimate of the non-linear matter power spectrum and using 
A-body simulations to achieve the desired accuracy would be computationally prohibitive. 
Even calculating data likelihoods for cosmic microwave background (CMB) data, such as 
WMAP [1] and Planck [2], requires the angular power spectra, which must be obtained 
through the evolution of Boltzmann codes like Camb [3] and Class [4, 5] that take signih- 
cant fractions of a second to calculate for each set of parameters in a long Markov chain. 

A common approach to speeding up computationally expensive calculations is to employ 
a statistical model that emulates an expensive calculation using a training set from which 
results are empirically approximated.^ Several such algorithms for cosmological data analy¬ 
sis have been described, and associated codes released [7-10]. Recombination and the CMB 
likelihood calculations are emulated by RiCO [11] and PiCO [12, 13] respectively, while the 
likelihoods themselves are cached and re-used for interpolation with InterpMC [14]. The 

^Following the statistics literature, e.g., Ref. [6], we use the term emulator to refer to a statistical or 
deterministic model that approximates the output of a simulation code in a computationally efficient fashion. 
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Figure 1. The relationship between different classes in the object-oriented implementation of our 
learn-as-you-go algorithm. An arrow pointing from one block to another indicates the first class 
“knows about” or uses the output of the second. 

Coyote Universe suite [15-21] emulates the matter power spectrum using Gaussian pro¬ 
cesses and Af-body simulations, with significant attention placed on the optimal construction 
of training sets. 

Although emulators can significantly accelerate calculations, it can be difficult to esti¬ 
mate and control the errors they induce. Consequently, emulators can only be used if the 
methodology is known to be highly reliable. Two obvious but unattractive solutions to this 
dilemma are to make direct comparisons with the final exact calculation or to build very 
dense training sets. Further, because the specific regions of parameter space to be explored 
may not be known in advance, the training must either span a large region of parameter 
space or be extended dynamically in order to ensure accuracy. Obviously, the advantages of 
emulation are eroded if significant computational resources are needed to train the emulator 
or the uncertainty associated with the emulation is not well-controlled. 

An approach to solving this problem was demonstrated in Refs [14, 22-24], where an 
external statistical sampling method was coupled to a machine learning algorithm to dynam¬ 
ically rehne a statistical emulator. This results in a training set that is dense only in areas of 
parameter space where it is needed, while simultaneously evaluating a model’s posterior. We 
call methods of this sort learn-as-you-go emulators, since they both accelerate a calculation 
that would be performed in any case and build a training set for future use. 

In this paper we introduce a flexible and powerful learn-as-you-go algorithm that in¬ 
corporates an adaptive and trainable error model for the differences between emulated and 
“exact” cosmological likelihood evaluations.^ We use well-tested sampling methods to simul¬ 
taneously evaluate model posteriors, generate the training set, and update the error model 
by cross-validation. Critically, the local emulation errors are propagated through the calcu¬ 
lation, allowing us to estimate the error in the posterior probability p(x | D) for data D and 
parameter x induced by our use of emulation. Consequently, the uncertainty in the posteriors 
is modelled explicitly, quantifying the robustness of the estimated posterior. 

Cosmological parameter estimation using our algorithm has five major components: 

1. Sampler. — An externally defined method to explore parameter space. 

2. Data likelihood. — The pre-defined function that we intend to emulate. 

^A fully documented implementation of the algorithm is included in COSMOH—h [25], available at 
http://cosmopp.com. The code is modular and could easily be applied to similar problems in other fields. In 
addition, we provide a fast version of the Planck likelihood module that uses this algorithm (see Sect. 6 and 
Appendix B for further details). 
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3. Emulator. — Given a training set, perform a rapid emulation for new parameter values 
of interest. 

4. Error model. — For every new point, estimate the error made by emulating. Propa¬ 
gate the errors in the likelihood from the emulation process into errors for the posterior 
distributions. 

5. Learn-as-you-go. — Continuously update the training set of the emulator based on 
new exact evaluations. Also, continuously update the error model. Determine whether 
the emulated value is acceptable or if the exact calculation should be performed. 

Fig. 1 describes the relationship between the different classes in our publicly available, object- 
oriented C-|—|- implementation for parameter space sampling and likelihood evaluations. 

Our techniques are not restricted to cosmology and can be easily applied to any com¬ 
putationally expensive calculation that depends on a fixed number of input parameters. The 
algorithm does not make strong assumptions about the properties of the parameter space 
and is usable with a wide variety of statistical samplers. Frequently repeated operations take 
at most O(logA^tr) time, where Atr is the size of the training set, so that the speed of the 
algorithm is only weakly dependent on At^. Furthermore, the cost of evaluating the error 
model and adding new points to the training set is negligibly small for typical calculations. 
Our implementation scales well in an MPI environment, with all processes sharing a single 
training set. Unlike the approach of Ref. [14] we do not assume that the parameter space is 
simply connected or has a single, global maximum. 

Following Ref. [14], we use the emulated value of the data likelihood for any set of pa¬ 
rameters for which the emulation error is below a user-defined threshold; otherwise the exact 
calculation is performed and its result added to the training set. Defining an unacceptable 
level of error between emulated and non-emulated values lets the user tradeoff between ac¬ 
curacy and acceleration, assuming that the error model is sufficiently robust. This approach 
ensures that there are no redundancies in the training set; new points are not added where 
they are not needed and no time is spent generating training data that are not used. In 
common with the techniques described in Refs [14, 23], we do not need an initial training 
set, but providing one can signihcantly increase the resulting acceleration. 

We demonstrate the use of the algorithm by accelerating estimates of posterior proba¬ 
bilities for ACDM with CMB likelihoods. CMB analyses provide a useful testing platform for 
our methods because they can be conducted in a reasonable time but are still slow enough 
that any improvement in speed will be extremely welcome. Starting with an empty training 
set we use the emulator with two parameter space samplers: the Metropolis-Hastings [26] 
MCMC method implemented in CoSMO-| — h [25] and nested sampling with the MultiNest 
package [27-29]. We track the number of exact likelihood calculations required with and 
without emulation, as well as the overall speedup. We demonstrate that the use of the em¬ 
ulator results in negligibly small errors in the posterior distributions of the parameters as 
determined by both our pre-defined error model and the exact posteriors obtained without 
emulation. 

2 Emulation Algorithm 

While cosmological applications will be primarily concerned with emulating the likelihood 
function ^{D\x.), for observational data D and model parameters x, in general we could 
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Symbol 

Description 

Symbol 

Description 

P 

Probability distribution func¬ 
tion 

D 

Data 


Data likelihood 

f 

Exact function intended for 
emulation 

fern 

Emulating function that ap¬ 
proximates / 

n 

Number of dimensions in pa¬ 
rameter space (domain of /) 

X 

Parameter space vector 

Strain 

Training set of parameter 
space points 

^rain 

Range of training set 

/(Attain) 

m 

Number of dimensions for the 
range of / 

p 

Model-dependent measure 

of local sparsity of training 
points 

k 

Number of nearest neighbors 
used in analysis 

w 

Weighting of nearest neigh¬ 
bors 

C 

Covariance matrix of training 
set 

L 

Cholesky decomposed covari¬ 
ance C = LL^ 

<5 

The degree of the polynomial 
interpolation for /em 

A/ 

Local emulation error, i.e., 
difference between /em(x) and 
/(x) 

Cfj 

Local emulation error on arbi¬ 
trary scalar function r/ 

Ap 

r 

Global emulation error in pos¬ 
terior probability 

Ratio of Brj to p 

On 

Nuisance parameters 


Table 1. A legend for our notation. 


emulate any arbitrary function /(x). In this Section we will keep our discussion as general 
as possible, and will introduce an explicit local error model in Sect. 3. 

Suppose we need to calculate the function 

/ : M*" ^ (2.1) 

a large number of times with a specified accuracy. The exact numerical evaluation of / is 
assumed to be slow, so we wish to store previously calculated results in order to rapidly 
estimate the value of / for new input points x E M”. The emulated value of / is denoted by 

fern- 

A detailed schematic of the emulation algorithm is given in Alg. 1 and the implemen¬ 
tation of our publicly available code is discussed in Appendix B. In particular, the k-d tree 
used to locate the points used for the emulation is described in B.l and the emulator imple¬ 
mentation is discussed in B.2. We expand on each of these functions below: 

InitializeEmul. — This function builds and saves a k-d tree from the training set Xtrain 
following Ref. [30] . We linearly transform the input points x —>• x' for a new basis in which 
all of the parameters in the transformed training set are uncorrelated and where all of the 
components for each x' have the same order of magnitude. To achieve this, we calculate the 
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Algorithm 1 Emulates a function / that maps x E —>■ y E M™', given a training set 

consisting of parameter space points {xj} and their mappings {yj = /(xj)}. 

function lNITIALIZEEMUL(Atrain = {xj : Z = 1, Atr},Etrain = {Y* ■ i = 1, Atr}) 

Calculate the sample covariance matrix C of points x* in Atrain 
Cholesky decompose C into matrices L by C = LL^ 

Build training set in new basis = {L“^Xj : i = 1, Atr} 

Build k-d tree T^d from 

function Emulate(xo) 

Change basis by defining Xq = L“^xo 

Using Tkd, find k nearest neighbors = {x^^ : i = 1, A:} of Xq 
F ind the corresponding = {yf'^ = /(x^^) : i = l,k} from Atrain 
for all nearest neighbors x^^ E A^^ do 

Calculate weighting tc* = l/Distance(xQ, x^^) for some metric 

for all j = 1, m do 

Assume (/em(x))j = -Pj(x') around Xq where Pj is a polynomial 
Find coefficients of Pj to minimize Sj = ^*1 Aj(x^^) — P 

return /em(x) = (-Pi(xo), ..., Em(xo)) 
function ADDPoiNTToTRAiNSET(x,y) 

Add X to Atrain and y to Atrain 
Add x' = L-^x to A^ain and T^d 
if Depth(rfcrf) > 41og Atr then 
Rebalance Tkd 

procedure SimpleEmulator 

Calculate training sets Atrain and Atrain = /(Atrain) exactly 
Set up the emulator by calling InitializeEmul(A train, Strain) 

Emulate /(x) by calling Emulate(x) 


sample correlation matrix, C, of the points in the training set, find its Cholesky decomposition 
matrix C = LL^, and use L to change the basis to x' = L“^x. 

Emulate. — This function computes /em for an input point xq, finding the k nearest neigh¬ 
bors (A;-NN) of X in the A:-d tree built from Atrain- These near-by points in the training set 
are then interpolated and the value of the polynomial interpolation at xq is used for /em(xo)- 
As the k-d tree in InitializeEmul stores input points in the Cholesky basis, xq is first 
transformed to that basis with xq —>■ Xg, after which the k nearest neighbors are located 
using the k-d tree. Given we are working with dimensionless, normalized variables we use 
the Euclidean distance to put a metric on the parameter space, but other possibilities exist 
including the Fisher information metric. In general, /(x) = (/i,...,/^) with m > 1, so 
for each dimension j = 1,... ,m we build the component of /em by an independent, n- 
dimensional polynomial interpolation near the point Xg. The coefficients of the polynomial 
are found simply by doing a weighted least squares fit to the k nearest neighbors, with weights 
u)* inversely proportional to the Euclidean distance. The details of the fit are discussed in 
Appendix A. 
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An n-dimensional polynomial of degree 6 will have N^oeS coefficients, with 


coeff — 1 “t“ n -j- 


n(n + 1) 


+ ••• + 


n(n + 1) • • • (n + <5 — 1) 

Jl ■ 


( 2 . 2 ) 


To avoid degeneracies we use k = 2Acoefr- It is easy to see that Acoefr increases exponentially 
with 6, so choosing a high dimensional polynomial will require a large number of nearest 
neighbors, as well as a least squares fit in a high-dimensional space, resulting in a uselessly 
slow emulator. Consequently, we have restricted our implementation to linear and quadratic 
polynomials. 


AddPointToTrainSet. — This function adds new points to the training set by first trans¬ 
forming to the Cholesky basis and then adding them to the appropriate location in the k-d 
tree. After adding many points the k-d tree may become unbalanced, resulting in a slowdown 
of the nearest neighbor search. After adding a new point we check if the tree has become 
unbalanced, and rebuild it if necessary. The criterion for this step is that the depth of the 
tree is four times larger than the depth of a perfectly balanced tree. This ensures that the 
tree always has an O(logAtr) depth. In practice, this rebuilding is sufficiently infrequent 
that it makes a negligible contribution to the overall computational cost of the algorithm. 


SimpleEmulator. — This is the simplest possible implementation of the above functions. 
After obtaining a training set, InitializeEmul builds the k-d tree with /em(x) computed 
by Emulate for a parameter space point x. 

Computational complexity. — If the dimensions n and m of the input and output spaces 
of fem, respectively, are much smaller than iVtr, the run-time of the algorithm depends 
primarily on the size iVtr of the training set. The complexity of InitializeEmul is dominated 
by the construction of the k-d tree, which is 0(Atr log Atr)- Finding the k nearest neighbors 
takes only 0(log Atr) time [30] and the interpolation step is independent of Atr, which fixes 
the complexity of Emulate. Finally, adding a new point to the k-d tree is linear in the 
depth of the tree, i.e., adding new points is 0(log Atr). However, the infrequent rebalancing 
step scales as 0(Atr log Atr). 


3 Error Analysis 

We emulate the function / in Eq. (2.1) with an approximate function /em using the method 
described in Sect. 2. However, we also want to estimate the difference between / and /em- 
This is critical both to the estimate of the total error in our calculation and when determining 
whether or not the approximation /em(x) is acceptable at a specific point x. 

We define the local emulation error A/ at a point x as 

^/(x) = /em(x) - /(x). (3.1) 

Unless X is an element of the training set Atrain we will not know /(x), so we treat A/ as a 
random variable with a probability distribution p{Af). It is also convenient to define a local 
error via a scalar function rj : ^ R, 

e^(x) = r/(/em(x)) - r/(/(x)). (3.2) 
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Algorithm 2 Building an empirical model for the local error on the scalar function t], 
using a set of exact calculations lev = /(-^cv) that the emulating function /em has not yet 
been trained on._ 

function BuildErrorModel(?7,Xcv = {x* : i = l,M},lcv = {yi = fi^i) '■ i 

Initialize an empty list R 
for all points Xj in Xq\ do 

Calculate p{^i)-, e.g., as the mean distance to the k nearest neighbors 
Add ratio |7/(EMULATE(xj)) — r]{yi)\/p{xi) to R, with Emulate from Alg. 1 
Create histogram H from R 

Smooth and normalize H to obtain an estimate of p{\er^\/p) 

procedure SimpleError 

Calculate sets Aev and lev = /(-^cv) exactly 

Set up the emulator by calling iNiTiALiZEEMUL(Xtrain) l^rain) 

Estimate p{\erj\/p) by calling BuildErrorModel(? 7, Aev, lev) 

For point x calculate p(e,,(x)) oc p(x) x p(|e^|/p) and renormalize 


The form of p is not unique, beyond requiring that it reduces the dimensionality of the output 
space of / to unity. Possible choices include a spatial average on the output space, p{f{x)) = 
(/(x)), or by evaluating the likelihood for experimental data, r/(/(x)) = p{D \ /(x))).^ 

We want an error model that is broadly applicable, computationally efficient, and up¬ 
dates automatically as the training set grows. We expect that p(A/) and p(e^) will be 
strongly dependent on the distances between x and its nearest points in Wrain, which are 
used for the interpolation. We will use a single parameter p(x \ Wrain) to define the proba¬ 
bility distribution p{erj \ p, Wrain)- We choose p to be inversely proportional to the density of 
the training set Wrain near x, typically the mean n-dimensional Euclidean distance to the k 
nearest neighbors in Xtrain- 

The form of p{er^ \ p, Xtrain) could be modelled using a priori knowledge of / and p, but 
it is far more flexible to estimate p{er^ \ p, Xtrain) empirically, via cross-validation on subsets 
Xcv of Xtrainj from which Orj Can be calculated exactly. We assume that increases linearly 
with p, allowing us to define a probability distribution on their ratio as 



(3.3) 


up to an arbitrary normalization. 

We describe our error model algorithm schematically in Alg. 2 and its detailed imple¬ 
mentation is given in Appendix B.3. The function BuildErrorModel takes as input the 
cross validation sets Xcv and Yev = /(Xcv)- For each point Xj in Xcv h calculates the lo¬ 
cal scalar error e^(xj) and the training set sparsity measure p(xj), using the mean Euclidean 
distance to the k nearest neighbors. We enforce our assumption oc p by constructing a 
smoothed and normalized histogram for the ratio |e,y|/p, from which we empirically estimate 
P(l®r)|/p)-^ SimpleError evaluates p(e^(x)) by inverting the transformation in Eq. (3.3), 

®We will typically assume that p(y) is inexpensive to calculate compared to /(x), if we are given y. If it 
is not, then we would ideally emulate the composite function f = go f directly. 

“^We also save the 1 a upper bound h\a on J3(|e,,|/p) for use in the learn-as-you-go algorithm. 
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Algorithm 3 Implementation of emulation scheme (Alg. 1) and error model (Alg. 2) into 
a learn-as-you-go method that progressively builds a training set Atrain, learns the error 
distribution p(e,j), and decides when to emulate or exactly evaluate /(x). 
function PREPAREEMULATOR(r7,Atrain,^train5</>Cv) 
if Size(Ati-ain) ^ Ajuijj then 

Move a fraction cpcY of the points in Atrain,^rain to Xqv, hcv (chosen randomly) 
Initialize the Emulator with iNiTiALiZEEMUL(Atrainj^rain) 

Build the Error Model with BuildErrorModel(? 7, Acvj^cv) 

Add Acv and Ycv back to Atrain and Itrain, respectively 
Save Size(Atrain) as CurrentSize 
function EmulationIsValid(x) 

Calculate /9(x), e.g., as the mean distance to the k nearest neighbors 

Define = hi^p where 6io- is the 1 a upper bound of histogram H from error model 

return true if < e, else return false 

function CalculateExact(x) 

Add X to Atrain and the exact value y = /(x) to Er a in 
if Emulator and Error model are initialized then 

Add X, y to training set using ADDPoiNTToTRAiNSET(x,y) 
if Size(Atrain) > (1 + 0err) X CurrentSize then 

Reset the Emulator and the Error Model to uninitialized 
Reinitialize them using PREPAREEMULATOR(7/,Atrain,Erain,0Cv) 

else 

if Size(Atrain) > Amin then 

Initialize using PREPAREEMULATOR(77,Atrain,Eram)<^Cv) 

return y 

function CalculateLAYG(x) 

if Emulator and Error Model are initialized then 

if EmulationIsValid(x) then return Emulate(x) 
else return yexact = CalculateExact(x) 
else 

return yexact = CalculateExact(x) 
procedure LearnAsYouGo 

Define an external statistical sampler StatSampler, e.g., Metropolis-Bastings 
for X drawn from StatSampler do 

Evaluate and save the output from GalculateLAYG(x) as y 


multiplying p{\er^\/p) by /o(x), allowing for e,, to be positive or negative with equal probability, 
and renormalizing. 

4 Learn-As-You-Go Techniques 

We now combine the emulator from Sect. 2 and the error model from Sect. 3 into our learn- 
as-you-go algorithm. We assume the use of an externally defined statistical sampler, e.g., 
Metropolis-Bastings or nested sampling. The details of the emulation methods are completely 
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independent of the sampling strategy. The resulting algorithm is outlined schematically in 
Alg. 3; its detailed implementation is discussed in Appendix B.4; and we describe the purpose 
of its separate functions below. 

PrepareEmulator. — If the size of the training set Aitrain exceeds a pre-defined threshold 
Amin; this function initialises the emulator. PrepareEmulator is called multiple times: 
first when the training set size reaches Amin and then every time the training set size grows 
by a fraction ^erRj as explained below. 

PrepareEmulator first divides the training set Atrain into a smaller training set 
"^train Atrain and a cross Validation set Acv by randomly choosing a fraction 0cv of points 
from Atrain and removing them from the training set. It is important to ensure that points in 
Acv do not coincide with points in Atrain, since points in Atrain have no emulation error by 
definition. The remainder of the training set is temporarily used to initialize the emulator, 
while Acv is used to initialize the error evaluation module. 

After computing the error model from Acv, the points in this set are added back to 
Atrain; further evaluations rely on the complete training set Atrain- Finally, CurrentSize 
stores the size of Atrain at the point when the emulator and the error model were last updated. 
When the size of Atrain increases by the fraction (/)err compared to CurrentSize, we update 
the emulator and error model. 

EmulationisValid. — This function determines whether or not the approximate value 
/em(x) should be accepted at a given point x. Following Alg. 2, we determine the value 610 - 
of the 1 a upper bound on the probability distribution function for the error ratio p(|e,j|// 9 ), 
for the scalar function ry. For a point x, EmulationIsValid calculates the average Euclidean 
distance p(x) to its k nearest neighbors in Atrain, then multiplies bi^ with p(x) to determine 
the la upper bound on e^(x). Emulation is accepted if this upper bound on the error, 
is less than a target error threshold e. 

CalculateLAYG and CalculateExact. — As in LearnAsYouGo, CalculateLAYG 
is called for any evaluation of /em(x). This function first checks if the error in the emulation 
scheme is acceptable. If it is, then the emulator from Alg. 1 is used to evaluate /em(x). If 
the error is not acceptable, CalculateExact is called to exactly evaluate /(x) and add x 
to Atrain and /(x) to Itrain- Every time a new point is added to Atrain, this function checks 
if the size of Atrain has grown by a fraction (/>err since the error model was last updated, in 
which case PrepareEmulator is called. 

LearnAsYouGo. — Given a pre-defined method for obtaining points x from some proba¬ 
bility distribution of interest, CalculateLAYG is called on x and its results are saved in a 
Markov chain. 

5 Propagating Local Errors to Posteriors 

Let us now specialize to the important case where the function /(x) in Eq. (2.1), which we 
intend to emulate, is the logarithm of a likelihood function 

/(x) ^ £(x) = logj2f(T)| x), (5.1) 

which is used by a statistical sampler to obtain posterior probability distributions. The 
learn-as-you-go Alg. 3 allows for estimation of the local error at each point x, and we will 
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translate this into a global uncertainty Ap(x | D) in the posterior probability distributions. 
In this section we will explicitly calculate this error and in Sect. 6 we will describe its use in 
cosmological parameter estimation problems.^ 

The posterior probability distribution of the parameter space vector x is related to the 
likelihood function .if(L) | x) by Bayes theorem; 


p(x I D) 


1 


Af(T>|x)p(x), 


where p(x) is the prior probability distribution for x, and 



Af(Z) I x)p(x) dx 


(5.2) 


(5.3) 


is the Bayesian evidence or marginalized likelihood. 

We will be interested in posterior distributions for an individual parameter xq, found 
after marginalizing over a set of nuisance parameters given by 


p(xo I D) 


j p{xo,6N\D)deN 


I xo, 0 n)p(xo, ^n) dON, 


(5.4) 


where Bayes theorem (5.2) was used in the second relationship. The error in the emulated 
marginalized posterior 

Ap{xo\D) = pem{xo\ D) - p{xo\D) (5.5) 

can be expressed in terms of emulated quantities as 


Ap{xo I D) 



Jp{xo,On) A^{D\xo,eN)deN 


Pem{xo I D) 

^em 


AZ{D) 


(5.6) 


to first-order in the error terms AJZ and AZ, where 


AZ{D) = 


J p{x) A^{D I x) dx. 


(5.7) 


The error in the Bayesian evidence AZ and data likelihood A££ are defined similarly to 
Eq. (5.5), where emulated quantities are evaluated using Hem from the learn-as-you-go emu¬ 
lation scheme of Sect. 4. 

Assuming the errors Ap and AAf are small, we can rearrange Eqs (5.6) and (5.7) as 

Alogp(xo|T)) = j Alog^{D\xQ,6N) ^iO]s[\xo)d6N (5.8) 


where we have defined the normalized probability distribution 

\ _ Peril (^^-O) I 7?) 

“ Pem{xo\D) ■ 

®Our publicly available code includes the posterior distribution error estimation, allowing the user to easily 
obtain the uncertainties on marginalized posterior distributions resulting from the emulation process. 
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The first term in Eq. (5.8) is the contribution to the posterior error from the local region 
around xq and the second term is the uncertainty in the overall normalization for p{xq \ D). 
The posterior error Ap{xo \ D) is proportional to the posterior itself, making the emulation 
error greatest near regions of large a posteriori probability. 

The learn-as-you-go emulation algorithm in Sect. 4 yields a sample of points iTgamp 
that are distributed according to pem(x| D), which we will use to evaluate Eq. (5.8). Given 
Xs am p we first calculate pem{xo\D) by histogram estimation, marginalizing over all other 
parameters and taking only the xq component of each element x E -Tgamp- Specifically, by 
taking only the xq dimension from the elements of iTgamp and dividing the result into A^bins 
bins, we can approximate p{xo \ D) by the fraction of the total sample that is in the bin to 
which Xq belongs, with the global normalization set by requiring f p(xo | D)dxo = 1. 

Since any Xq in the same bin as xq will have the same emulated posterior probability, 
we can obtain a sample of nuisance parameters ©samp that follow iiP{dj\f\xo) by taking the 
On from those points x' E Xgamp that have their Xq in the same bin as xq. We then estimate 
the integrals in Eq. (5.8) by 

A^bin Afgamp 

A logp(xo I D) PS ^ Ai{D I xo, 0%) - — - ^ Ai{D \ xj) (5.10) 

JVsamp 


for Xj E Asamp and 0^ G ©samp' 

With a probabilistic error model for A£(x) we can use the Lyapunov central limit 
theorem [31] to conclude that the result of each of the sums in Eq. (5.10) will be normally 
distributed, assuming that the A£ are uncorrelated for different points and both Abin and 
Asamp are sufficiently large. We expect the correlations in the Ai to be small in our learn-as- 
you-go emulation scheme, since the emulator is only used for points where the training set is 
dense. Consequently, two nearby points will typically have different interpolation functions 
over their nearest neighbors, resulting in an essentially independent error. For each emulated 
point X in Agamp, Alg. 2 will calculate the mean /i£(x) and variance a'J{x) of the probability 
distribution for Ai{x). For each point that is exactly evaluated, A^(x) = 0. 

Finally, the probability distribution for the error in the log-posterior approaches 

p{Alogp{xo\D)) ^[p,Q,ao], (5.11) 

where jV is a normal distribution with mean 

^ Wbin -y A^samp 

M X] “ AT- X] 

J''samp 

and variance 

^ A^bin -y A^samp 

^0 = + a72— ^ (5-13) 

-'''bin j=i -'''samp 

We have defined the error model in Sect. 3 so that pi = 0, which makes pQ vanish. This gives 
a simple way to evaluate the error in each bin of the marginalized posterior Pera{xo \ D) by 
combining the variances of the A^ for each point in the sample. By defining an upper limit 
cr^ax£ ™ allowed acceptable variance in the error model for A£, Eq. (5.13) bounds the 
variance in the error in the log-posterior as cJq < <7^ax£/-^bin) assuming Abin Agamp- 
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In practice we first evaluate the posterior probability distribution pem(a;o I D) from Xgamp 
by histogram estimation, as well as the local error in each bin, i.e., the first term in Eq. (5.10), 
ignoring the issue of normalization. Once the error is evaluated in each bin, we calculate the 
total normalization factor by summing all the bins and adding the errors of each bin according 
to the central limit theorem. We then divide each bin by the normalization factor. Keeping 
only first-order terms, we add the error coming from normalization to the error in each bin, 
to estimate the total error. 

6 CMB and Planck Likelihood 

We now apply this algorithm to CMB power spectrum and Planck likelihood calculations, 
using the 2013 likelihood code [2].® We show posterior probabilities for both emulated and 
non-emulated scenarios in Figs 2-3, including the Gaussian-distributed posterior error as 
calculated in Eq. (5.11). We compare different values of the tolerance on the cutoff for the 
allowed error in the log-likelihood as well as two standard sampling techniques to show the 
robustness of this approach. Finally, we discuss the speed-ups achieved. 

6.1 Details of procedure 

The Planck likelihood function is a product of several distinct independent likelihoods, the 
low-Z (Commander) and high-Z (CamSpec) temperature likelihoods, the leasing likelihood, 
and the WMAP polarization likelihood. Commander depends on the cosmological model 
through the CMB TT power spectrum only, while CamSpec is also a function of 14 nuisance 
parameters. The leasing likelihood takes as an input the TT and <h<h power spectra for the 
model, where is the leasing potential, and the WMAP polarization likelihood takes the 
TT, EE, TE, and BB spectra of the model as its input. 

The CamSpec likelihood can be evaluated relatively quickly and is the only part of the 
total Planck likelihood that depends on the additional nuisance parameters, so we exclude 
it from the emulation scheme to significantly reduce the dimensionality of the training set. 
Our approximation algorithm thus uses only the cosmological parameters as input param¬ 
eters and provides the Cf^, and Commander, lensing, and polarization likelihoods. Each 
set of cosmological parameters will be calculated using the learn-as-you-go approximation 
algorithm, which will decide whether or not to use the exact numerical solution for the out¬ 
puts. CamSpec is called with the interpolated values and the foreground parameters, 
completing the likelihood calculation. 

The number of cosmological parameters n depends on the cosmological model, but 
typically n ~ 1^(10).^ From Eq. 2.2 we choose quadratic interpolation between the nearest 
neighbors, for which we need 1 -|- n -|- n{n + l)/2 free parameters. 

We use the simple error model described in Sect. 3. We set (/)err = 0.05, which uses 
5% of the training set as a cross-validation set. However, when the cross-validation set size 
becomes larger than 1,000 we keep only 1,000 points. The function rj chosen for error eval¬ 
uation is the entire Planck likelihood function, evaluated from the output of the emulator. 
Specifically, the emulator calculates Cf'^ values, as well as the Commander, lensing, and 
polarization likelihoods. The function r/ passes the Cf^ and the best-ht values of the fore¬ 
ground parameters to CamSpec. The result of this is combined with the other likelihoods. 

®Our public code can be easily used with the new Planck likelihood code when it becomes available. 

^For ACDM the number of parameters is 6. 
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Par am. 

MCMC exact 

MCMC approx. 

MultiNest exact 

MultiNest approx. 


0.02210 ±0.00028 

0.02210 ±0.00029 

0.02210 ± 0.00028 

0.02210 ± 0.00028 


0.1196 ±0.0027 

0.1196 ±0.0027 

0.1195 ±0.0026 

0.1195 ±0.0027 

h 

0.680 ±0.012 

0.680 ±0.012 

0.0680 ± 0.012 

0.0680 ±0.012 

T 

0.089 ±0.013 

0.089 ±0.013 

0.089 ±0.013 

0.089 ± 0.013 

rig 

0.9613 ± 0.0074 

0.9614 ± 0.0075 

0.9613 ±0.0071 

0.9615 ± 0.0074 

Ag 

3.087 ±0.025 

3.088 ±0.025 

3.087 ±0.026 

3.088 ± 0.025 

logZ 

- 

— 

-4944.0 ± 0.3 

-4944.1 ±0.3 


Table 2. Parameter constraints from MCMC and MultiNest samplers with and without using the 
emulator. 

We choose the weighted average distance to the nearest neighbors as the measure p of the 
local sparsity of the training points. 

The acceptable amount of error in Ai is defined by the ratio r 2 a at which the cumulative 
distribution equals 0.955. For each set of input parameters we then find its k nearest neighbors 
in Xtrain, the weighted average Euclidean distance to them, and multiply the distance by r 2 a 
to find the 2 a upper bound on the error at that point. We choose a threshold of 0.4 for the 
2cJ upper bound on the error of —2 log .if, which corresponds to a 0.1 threshold for the Icr 
upper bound on log .if for Gaussian errors. We use the emulator only if the 2cj error at a 
point is below this threshold; otherwise the likelihood is calculated exactly. Emulation is not 
attempted until there are at least A^min = 10, 000 points in the training set. 

The training set is continuously updated throughout the run, with each new point 
being immediately added to a k~d tree. We re-balance the tree as needed, using the criterion 
described in Sect. 2. We update the error model every time the training set increases in size 
by 25%, since likelihood evaluations for points in the cross-validation set are relatively time 
consuming. 

We use a parallel implementation in which all threads share their training sets each 
time a single thread has accumulated 10 new points. When using many independent Markov 
chains this gives a significant performance boost, since chains that start out in different 
regions of the parameter space soon explore regions other chains have visited. Eurthermore, 
we find that we frequently sample the same regions of cosmological parameter space with 
different values of the foreground parameters. Since these are used in the relatively fast 
CamSpec likelihood module, we have effectively implemented a fast-slow split analogous to 
that provided by CosmoMC [32], but without modifying the sampler itself. 

6.2 Emulated ACDM posteriors 

In Fig. 2 we plot the posterior distributions for the cosmological parameters of the standard 
ACDM model using the Planck 2013 temperature data together with WMAP polarization. 
We use the CoSMO-|--|- MCMC sampler and compare the posterior distributions of the cos¬ 
mological parameters h, r, Ug, and Ag with and without emulation. We also 
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Figure 2. Posterior distributions of the cosmological parameters from MCMC sampling with (solid 
blue lines) and without (dashed red lines) emulation. The 1 cr error of the posteriors for the case of 
emulation is shown as a gray band around the solid blue lines. 


performed this comparison for the MultiNest sampler, with essentially identical results. 
Fig. 3 shows the posterior of one parameter, obtained with MCMC (left panel) and 

MultiNest (middle panel). In all cases we start with an initially empty training set. 

We plot the la error range on the posteriors using the method in Sect. 5 with gray 
bands around the solid blue lines corresponding to the mean prediction. However, the errors 
are negligible and only visible if we magnify the axes substantially. Note that all posteriors 
include additional errors resulting from the finite sizes of the samples used to estimate the 
distributions, which are not included in our estimation of the errors. The uncertainty in the 
posterior induced by the emulation algorithm is smaller than this sampling uncertainty; and 
the posteriors with and without emulation agree very well. 

The right panel of Fig. 3 demonstrates a more substantial error on the posterior dis¬ 
tributions by repeating the MCMC run with only two chains and a higher threshold for the 
error. We have allowed points to be emulated whenever the 2cr upper bound on —2 log .if 
is less than 1.0 instead of 0.4. Using only two chains compared to 10 decreases the number 
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Figure 3. Posterior distribution of Vlbh? with (solid blue lines) and without (dashed red lines) using 
emulation. The 1 a error of the posteriors for the case of emulation is shown as a gray band around 
the solid blue lines. The left panel is for MCMC (identical to the top left plot in Fig. 2, the middle 
panel is for MultiNest, and the right panel is for MCMC again but with a larger error threshold 
(see the discussion in the text). 


of points in the sample by a factor of 5, and increases the error bars by a factor of \/5 (see 
Eq. (5.13)). The error band for the posterior is wider than in the previous cases (left and 
middle panels) but is still comparable to the errors coming from the finite size of the sample. 

The 1 a confidence intervals for the cosmological parameters and Bayesian evidences® 
are summarized in Table 2. The results with emulation are almost identical to the exact case. 
Specifically, the confidence interval widths for the parameters vary by less than 5%, and the 
central values vary by less than (t/ 20. Such small numerical differences can arise even if two 
runs are performed with the exact same likelihood, so the emulated likelihoods involve no 
sacrifice of accuracy with this set of run parameters. 

Figure 4 shows the estimated probability distribution of the errors p{\er^\/p) for all the 
stages of error evaluation during the MCMC run. The error model is periodically updated 
every time the training set size increases by a fraction </>err, which we have chosen to be 
25%. The red solid line shows the first evaluation, the blue solid line shows the last one, and 
the dashed lines show the intermediate stages. The increase of the training set size results 
in distributions with smaller expected variance for p(|e^|//9) and a progressively higher peak 
at ^r]!P = 0. 

6.3 Estimated improvement 

In Table 3 we have estimated the speedup from using the emulator for a single likelihood 
calculation with a well-developed training set of 150, 000 points. We perform 100 exact 
calculations of the CMB power spectra and Planck likelihoods to evaluate the average run¬ 
time for each step and compare this to 100 rapid calculations using the emulator. We estimate 
the run-times on two different architectures, Darwin with 4 CPU cores, and Linux with 16 
CPU cores, using GNU compilers on Darwin and Intel compilers with the MKL library on 
Linux. 

^Computed using MultiNest. 
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Figure 4. Probability distribution of |e,,|/p for different stages of error model evaluation (see Section 
3) for the MCMC run. The labels show the size of the training set. 


The slowest calculation is the CMB power spectrum. The polarization and lensing like¬ 
lihood calculations are roughly as fast as the CMB power spectra calculation; Commander 
and CamSpec are approximately one and two orders of magnitude faster, respectively. The 
emulated likelihood approximation is only slightly slower than CamSpec, meaning that the 
emulation itself takes a negligible amount of the run-time. Furthermore, since the compu¬ 
tational complexity depends only logarithmically on the training set size Nt,-, changing Nt,- 
will have a negligible impact on the overall speed. 

The speedups shown in Table 3 are only achievable after obtaining a large training 
set. However, in the learn-as-you-go scenario we start with an empty training set with the 
algorithm training itself with exact likelihood evaluations. Gradually the estimated error 
in the emulator will decrease with an increasing size for the training set and the estimated 
likelihood values will be reliable for new points. We will now estimate the overall speedup that 
can be gained using the learn-as-you-go approach separately for the MCMC and MultiNest 
samplers. 

We start both samplers with an empty training set and compare the results with the 
case of no emulation. We run MCMC using adaptive posteriors with 10 parallel chains. The 
burn-in length is chosen to be 500 and the Gelman-Rubin convergence diagnostic [25, 33] 
is used with a convergence criterion of 1.01. Convergence occurs when the chains reach a 
length of approximately 6, 000 in both cases, with or without emulation. When no emulation 
is used and all of the processes are run on nodes of similar speeds the resulting chains have 
similar sizes. However, when using the emulator, different chains may progress at significantly 
different speeds depending on how frequently they walk into dense regions of the training set. 
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Calculation step 

Linux + intel (16 cores) 

Darwin + GNU (4 cores) 

CMB 

0.847 sec 

2.703 sec 

CamSpec 

0.008 sec 

0.044 sec 

Commander 

0.062 sec 

0.167 sec 

Polarization 

0.190 sec 

1.035 sec 

Lensing 

0.320 sec 

1.005 sec 

Rapid likelihood 

0.014 sec 

0.050 sec 

Speedup factors 


Temperature only 

67 

57 

Temp. + pol. 

81 

78 

Temp. + lens. 

91 

78 

Temp. + pol. + lens. 

105 

99 


Table 3. The average times of different steps of likelihood evaluation, determined from 100 repeated 
calculations (top). The speedup factors by using the emulator with 150, 000 training points (bottom). 




Figure 5. The ratio of the number of emulated to the number of exact likelihood evaluations as a 
function of the total number for the case of MCMC sampling (left) and MultiNest sampling (right). 
Different colors represent 5 randomly selected chains. 


resulting in chains of significantly different lengths. Convergence in this case is determined 
by the size of the slowest chain. MultiNest can also be parallelized very well, so we run 
it with 10 MPI processes and the recommended values of the other parameters for accurate 
evidence calculation [28]. 

We plot the ratio of the number of emulated to the number of exact likelihood calcu- 
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lations in Fig. 5. The left panel has 5 selected MCMC chains and the right panel has 5 
MPI processes for nested sampling. The different MCMC chains have significantly different 
lengths so we cut off all of the lines to match the smallest chain. 

These plots illustrate important characteristics of our learn-as-you-go approach. At the 
beginning of the run the ratio of emulated points starts and remains close to zero as the 
emulator builds a training set. However, once the emulator achieves a high-enough accuracy 
the ratio of emulated points increases roughly linearly. This divides the run into “mainly 
training” and “mainly emulation” periods for each chain, with a transition occuring after 
about 5, 000 total evaluations per chain for the MCMC case and after about 10,000 for 
MultiNest. In the “mainly emulation” period, the ratio of emulated to exact calculations 
depends linearly on the total number of evaluations with a slope of a « 2.11 x 10 ^ for 
MCMC and a ~ 7.5 x 10“^ for nested sampling. With a slope of a, the total number of 
exact evaluations per chain Aex will eventually saturate at Aex = 1/a as the total number of 
evaluations gets large, which is approximately 4.8 x 10^ and 1.3 x 10^ for MCMC and nested 
sampling, respectively. These upper limits are nearly coincident with the end of the “mainly 
training” period, indicating that there are almost no exactly evaluated calculations required 
during the “mainly emulation” stage. 

MultiNest implements nested sampling, so it starts in regions of small likelihoods and 
gradually converges to the peak of the likelihood [27-29]. This means that the training set is 
initially spread thinly across the parameter space, making emulation impossible. However, 
MultiNest eventually converges on the peak of the likelihood, where the training set be¬ 
comes sufficiently dense. The nested sampler then remains in the high likelihood regions, 
resulting in a more abrupt transition from “mainly training” to “mainly emulation” periods 
for MultiNest than for MCMC, as can be seen in Fig. 5. 

For MCMC sampling, the ratio of emulated to exact calculations is approximately 19 at 
the end of the run for the worst case chain and emulation was performed for about 95% of the 
points. Since the speedup factor for the temperature plus polarization likelihoods is about 
80 in Table 3, an overall speedup factor of about 16 would be expected if there were no other 
calculations performed. In practice, the wall-clock run-times for the MCMC calculations 
improved by a factor of 6.5 when run with emulation and an initially empty training set. For 
MultiNest, Fig. 5 shows that the final ratio of the number of emulated to the number of 
exact calculations reaches about 2.5-3.5 for different processes, compared to 19-20 for the 
MCMC sampler. For all the 10 chains, emulation was performed for 75% of all likelihood 
evaluations. Without any computational overhead, this would give an expected speedup of 
3.9; in practice we obtained a value of 3.5. 

7 Conclusion 

We have incorporated an explicit error model in a flexible learn-as-you-go emulation algo¬ 
rithm. When the slow calculation /(x) is a data likelihood function we have also shown that 
the estimated local error in the log-likelihood can be easily propagated into estimates of a 
model’s posterior probability distributions. Using a trustworthy error model then removes 
any need for performing redundant and slow comparisons to exact calculations. 

In Sect. 2 we described an emulation algorithm that approximates an arbitrary function 
/ : M”" —>■ at a parameter space point x by using polynomial interpolation over the k 

nearest neighbors to x in a pre-computed training set Aitrain- In Sect. 3 we implemented a 
simple error model to approximate the local difference between /(x) and its emulation /em(x). 
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Given an arbitrary measure p, which is inversely proportional to the density of the training set 
-Strain at the point X, our model requires the error to scale like A/(x) ~ p. We have chosen 
p to be the weighted mean Euclidean distance to the k nearest training points, although 
this could be relaxed in a more general error model. We empirically estimate a probability 
distribution p(A/(x)) by cross-validation on a subset Acv of the pre-computed training set 
-Strain- Sect. 5 Specializes to the case where we emulate a cosmological likelihood function 
^ and presents a method for propagating the local error to the error in marginalized 
posterior probability distributions. Since the error model is probabilistic, we can then report 
a probability distribution for the value of the posterior probability p{xo \ D), which tends to 
a normal distribution by the central limit theorem. 

Improving on Ref. [14], Sect. 4 implements a pre-defined tolerance e on the allowed 
error A/(x) in the emulated function. Parameter values that have emulation errors with 
a significant probability to be above a pre-defined threshold are directly computed, instead 
of emulated. Points that are calculated exactly are added to the training set Atrain and 
the error model is updated repeatedly based on the increasing size and range of Atrain- By 
adjusting this threshold, the algorithm emulates more or less liberally, with a corresponding 
trade-off between run-time improvement and accuracy. The specific details of the emulation 
algorithm, the error model, and the learn-as-you-go scheme can be understood from Algs 1, 
2, and 3, respectively. 

We have applied this methodology to ACDM parameter estimation from CMB likeli¬ 
hoods in Sect. 6. We chose CMB likelihoods because they are sufficiently slow to calculate 
that we see a moderate speed improvement when emulating, but are fast enough that we 
can exactly evaluate a significant fraction of the CMB Markov chains to develop a dense 
training set. Section 6.2 and Figs 2-3 have shown that the error in the emulation scheme is 
sub-dominant to the finite-sample statistical error when we choose typical cutoff values on 
the probability distribution of the error in —21ogAf. Section 6.3 approximates a speed-up 
of 0’{3 — 10) without the use of an initial training set with either MCMC or nested sam¬ 
pling methods. The learn-as-you-go algorithm that we have developed is parallelized, fast, 
and applicable to general situations outside of CMB likelihood emulation. We have publicly 
released all of our code in the CoSMO-|--|- package and have included the implementation 
details in Appendices A and B. 
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A Least Squares Fit 

This appendix discusses the details of the weighted least squares fit of Alg. 1 in Sect. 2, which 
is performed by mimimizing the sum 

5,=^rcV:,(xr)-(yn,p. (A.l) 

i=l 

®http://www.nesi.org.nz 
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The index j denotes the output coordinate of the target function / and the index i runs 
over all the neareast neighbors. The j-th component of is The weights are 

inversely proportional to the distances of the query point Xg to the nearest neighbors 
For convenience, we choose the polynomial to have the following form: 

n 

Pj (x) = Qj + (x — Xq) ^ + • • • + terms of degree 5 , (A.2) 

i=i 

where the upper index I denotes the Tth component of a vector. Since we will only need to 
evaluate Pj for Xg, we will only need to calculate the value of Oj. 

The coefficients of the polynomial Pj can be written as a vector 

Pj = (oj, ... ,hj ,..., terms of degree 5)"^ , (A.3) 

and the polynomial itself can be written in the form 

-Pj(x) = Pj • z, 

where z contains all of the products of the elements of (x — Xg) up to the degree 5 
z = (1, (xi - (xg) J ,..., (x„ - (xg)^) ,..., terms of degree <5) . 

The values of the parameters pj that minimize Sj can then be calculated by 

Pj = {Z^WZr^Z^Wy'j , 

where 



(zr\ 


( 

0 \ 




z = 


, w = 

1 0 

j 


UynJ 

(A.7) 


and each z?^'^ is calculated by substituting x^^ for x in (A.5). All of the matrices in Eq. (A.6) 
are independent of the output values of /. Consequently, the computationally costly matrix 
operations, including the inversion, need to be done only once even for a large dimensionality 
m of the output space of /. 

B Implementation in CosmoH—|- 

The tools described in this paper have been implemented in C++ and made publicly available 
as a part of the CoSMOH — h package. In this appendix we describe the different modules 
available and how to use them. CoSMOH — h includes full documentation for all of these 
modules. 

B.l k—d tree 

The class KDTree in kd_tree.hpp implements a simple k-d tree functionality. The tree can 
be built by passing a set of points to the constructor. The additional functionality includes 
inserting new points using the function insert, resetting the tree using reset, re-balancing the 
tree with the function reBalance, and finding the k nearest neighbors of a given point using 
findNearestNeighbors. This functionality is enough for using the k-d tree in the emulation 
algorithm described in this paper. The current implementation uses the Euclidean distance, 
but it can be extended to include other metrics. 


(A.4) 

(A.5) 

(A.6) 
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B.2 Emulator 


The emulator algorithm described in Section 2 has been implemented in the class FastAp- 
proximator in the include file fast_approximator .hpp. The constructor takes as input the 
dimensionality of the input and output spaces of the function / that is being approximated, 
the training set as two arrays of input and output values of /, and the number k of nearest 
neighbors to use in the approximation.^'^ The approximation for a new input point can be 
performed with the approximate function, which takes as an input the new point, and the 
interpolation method (currently linear and quadratic interpolations are implemented), and 
returns as an output the approximated value of the function.If desired, the function can 
also return the k nearest neighbors, distances to the nearest neighbors, and the indices of the 
nearest neighbors in the training set. 

The function approximate is implemented in two steps, and each of the steps can be 
called separately as a function. The first step involves finding the k nearest neighbors, 
which is implemented in the function findNearestNeighbors. This function takes the new 
input point and finds the nearest neighbors. If desired, the nearest neighbors themselves, 
the distances to them, and their indices in the training set can also be returned. After this 
step, the function getApproximation can be called, which takes as an input the interpolation 
method and returns the approximated value. Note that this function can only be called after 
findNearestN eighbors. The separation of the approximation algorithm into two steps can 
be useful in many cases. For example, the error evaluation model, such as in Sect. 3, may 
only depend on the characteristics of the nearest neighbors. So the nearest neighbors can be 
found as a first step, the error can then be evaluated and only if the error is small enough 
for the approximation to be acceptable can the interpolation be performed. This approach 
can save computational time by not calculating the interpolation, which involves matrix 
operations and can be somewhat slow, unless it is necessary. Another useful application 
could be obtaining both linear and quadratic interpolations without repeating the nearest 
neighbor search. If none of this is necessary then the two step approximation can be replaced 
by a single call to the function approximate. 

New points can be added to the training set with the function addPoint.^^ This function 
adds the new point to the fe-d tree, without recalculating the covariance matrix, and checks 
if the k-d tree is very unbalanced after the insertion, in which case the tree is rebalanced by 
simply rebuilding it. The criterion for the tree being unbalanced is that the depth becomes 
4 times larger than the depth of a balanced tree with the same number of elements (see 
Sect. 2). 

The entire training set can be updated using the reset function, which takes as an input 
the new training set. The user can choose whether or not to update the covariance matrix of 
the input points. For example, if the new training set is a superset of the old one with not 
many additions, it may not be necessary to update the covariance matrix. Also, updating the 
covariance matrix will alter the distances to the already existing points, which may further 
impact the error evaluation model described below. So if one is simply adding a few new 
points to the training set without recalibrating the error model, then the covariance matrix 
should not be changed. 

'^^The constructor is the equivalent of the function InitializeEmul in the schematic Alg. 1. 

^^The approximate function is equivalent to Emulate in Alg. 1. 

^^Equivalent to AddPointToTrainSet in Alg. 1. 
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B.3 Error Model 


The simple error model described in Sect. 3 has been implemented in the class FastApprox- 
imatorError in fast_approximator_error .hpp. The input to the constructor consists of a 
reference to a Fast Approximator object, a set of test points (the cross-validation points), a 
function /err for which the error is modelled, the error modelling method, the target accuracy, 
and the decision method}^ We describe each of these parameters in detail below. 

The function rj : M™' —>■ M takes the output of / and maps it to a single number. This 
function is used for modelling the error (see Sect. 3). For the Planek results described in 
Sect. 6 the fast approximation algorithm outputs a set of values as well as a few of 
the Planck likelihoods, which can then be translated to the total Planck likelihood. In this 
case, the decision of whether or not to use the approximation should be based on the error 
of the total likelihood, rather than the errors of individual Ci values or individual likelihood 
components. The function rj should then take the complex output of the approximation 
algorithm and translate it to the total likelihood. 

As described in Sect. 3, the probability distribution p of the error e,, is modelled via 
a distribution that depends on the ratio of Crj to a single parameter p as in Eq. (3.3). The 
assumption is that the error increases linearly with p, where p can be calculated rapidly for 
any point in question and its nearest neighbors. Alg. 2 uses the average Euclidean distance to 
the nearest neighbors for p. The error modelling method for Fast Approximator Error simply 
refers to the choice of p in general. The following options are currently implemented: 

• MIN_DISTANCE - The distance to the nearest neighbor. 

• AVG-DISTANCE - The average distance to the k nearest neighbors. 

• AVG-INV-DISTANCE ~ The weighted average distance to the k nearest neighbors, 
with weights inversely proportional to the distances, which is useful when performing 
a weighted interpolation between the nearest neighbors. 

• SUM_DISTANGE - The geometrical sum of the distances to the nearest neighbors. 
This is a useful measure of both how far the nearest neighbors are on average and how 
isotropically they are distributed around the given point. 

• LIN_QUAD_DIFE- The difference between linear and quadratic interpolations between 
the nearest neighbors. 

The test set (also referred to as the cross-validation set Acv) is used to estimate the 
distribution of this ratio. For each point in the test set, the measure p is calculated, the ap¬ 
proximate and exact values of p are calculated, and the ratio 16,^1//? is obtained. A histogram 
is then built using these ratios for all of the test points, which is further smoothed with a 
Gaussian kernel and normalized. 

For any new point the parameter p is calculated, which is used to rescale the histogram 
above and obtain the distribution p for the error e^^ at that point. The decision is made 
whether or not the approximation is acceptable, by translating the distribution to a single real 
number and comparing to the target threshold. The choice of the single number representing 
the distribution is made based on the decision method parameter passed to the constructor. 
The following choices are currently implemented: 

^®The constructor is equivalent to the BuildErrorModel function in the schematic Alg. 2. 
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• ONE_SIGMA - The 68.3% upper bound of the absolute value of the error. 

• TWO-SIGMA ~ The 95.5% upper bound of the absolute value of the error. 

• SQRT- VAR - The square root of the variance of the error. 

The approximation for a new point is performed using the function approximate. The 
approximation itself is performed by passing the point to the underlying Fast Approximator. 
The return value of this function indicates whether or not the approximation is acceptable. 
The function can also return the 68.3%, the 95.5% upper bounds of the error, as well as the 
variance and the mean, if any of those are requested. 

The error model can be recalibrated with a new test set using the reset function. The 
setPrecision function can be used to change the target accuracy of the model {i.e., the 
threshold for the error). The distribution for the ratio \erj\/p can be obtained using the 
getDistrib function. 

B.4 Learn-As-You-Go 

The LearnAsYouGo class in learn_as_you_go. hpp implements the “learn-as-you-go” func¬ 
tionality described in Sect. 4, using the Fast Approximator and FastApproximatorFrror classes 
described above. The purpose of LearnAsYouGo is to decide for each input point whether or 
not it should approximate or do the calculation exactly. If the exact calculation is performed 
then the result is also added to the training set. 

The input parameters to the constructor are the dimensionalities of the input and 
output spaces of the function / being calculated, a reference to /, the error model function 
r], the minimum number of points in the training set for which approximations are allowed, 
the error threshold e, as well as a file name in which the training set can be saved. The 
calculation is performed using the evaluate function,which will return either an acceptable 
approximation for the value of / at the given point, or its exact value. If the exact value 
is calculated and returned then it is automatically added to the training set. The target 
threshold e can be changed at any time with the setPrecision function. The training set can 
be saved into a file using writeIntoFile and retrieved using readFromFile. The progress can 
be continuously logged into a file by setting the log file using the logIntoFile function. The 
log will include the total number of function calls, the number of calls for which the input 
point already exists in the training set, the number of calls with successful approximations, 
and the number of calls with unsuccessful approximations, i.e., exact calculations. 

An important characteristic of LearnAsYouGo is that it has a parallel implementation. 
If an application using LearnAsYouGo is run with many MPI processes, then each process 
will have its own “learn-as-you-go” approximation class. However, all of the processes will 
periodically share their training sets with each other, so effectively each process will use as a 
training set all of the exact function calculations by all of the processes. The user does not 
need to do anything special to take advantage of this functionality. All that is required is to 
run many parallel MPI processes. 

B.5 Accelerated Planck Likelihood 

Finally, we use the LearnAsYouGo class to implement an accelerated “learn-as-you-go” 
Planck likelihood in the PlanckLikeFast class in planck_like_f ast .hpp. The interface is 

Equivalent to CalculateLAYG in Alg. 3. 
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similar to the Planck likelihood interface in COSMOH — h [25]. The implementation uses the 
LearnAsYouGo class with the values and the slower Planck likelihoods as the function to 
be approximated (see Sect. 6), and the total Planck likelihood function to model the error. All 
of the functionality of LearnAsYouGo described above, including the parallel implementation, 
is present in PlanckLikeFast. Examples of using PlanckLikeFast in parameter space samplers 
are implemented in the test_mcmc_planck_f ast. cpp and test_multinest_planck_f ast. cpp 
files for Metropolis-Hastings and MultiNest samplers, respectively. Those implementations 
are used to obtain the results in Sect. 6. 

The constructor takes as an input a pointer to a GosmologicalParams object, which 
is used for setting the cosmological parameters, boolean (logical) values defining which of 
the likelihoods need to be included and whether or not tensor modes are included, as well 
as the number of points per decade in k space for which the primordial power spectrum is 
evaluated. Following these parameters, two parameters for the “learn-as-you-go” algorithm 
must be specified. These parameters are the target accuracy (the threshold for the error 
used to decide whether or not the approximation is acceptable), and the minimum size of 
the training set before approximations are performed. 

The calculation is done through the calculate function. This function enables the use of 
the PlanckLikeFast class in parameter space samplers. Since parameter space sampling is the 
main application for the “learn-as-you-go” likelihood, the detailed interface for calculating 
each component of the likelihood separately is not implemented here. The regular Planck 
likelihood interface PlanckLikelihood can be used for that purpose as in Ref. [25]. With this 
interface, it is very easy to replace the usual Planck likelihood code by the accelerated one 
in any parameter space sampling code. Simply replacing the PlanckLikelihood object with a 
PlanckLikeFast object initialized with all of the necessary parameters will suffice to use the 
accelerated likelihood code in any parameter space sampler present in CoSMO-|--|-. 

Since the function being approximated in this case is a likelihood function, we can apply 
the techniques developed in Sect. 5 to estimate the errors of the posterior distributions for 
the parameters of interest. Our implementation of PlanckLikeFast allows easy calculation 
of the posteriors with errors as follows. Before starting the parameter space sampling an 
error log file must be set using the logError function of PlanckLikeFast. Once this is set 
every call of calculate will output the call information into the error log file, which includes 
the parameters for the function call, the resulting likelihood, and estimated characteristics 
of the error, which can then be used to estimate the errors of the posteriors. Whenever the 
calculate function does the exact calculation the errors will be zero. There is no modification 
to the chain files produced by the parameter space samplers. After the sampling is finished, 
the MarkovChain class can be used to analyze the chains and obtain the posteriors. The new 
version of MarkovChain allows for calculation of errors of the posteriors by simply passing 
the error log hie produced by PlanckLikeFast as an extra argument to the constructor. In 
summary, all that is needed for producing posteriors with errors is generating an error log 
hie in PlanckLikeFast and then passing it to MarkovChain for analyzing. 
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